OpenCores
First Prev 3/3 no use no use
RE: Architecture
by ckavalipati on Oct 10, 2009
ckavalipati
Posts: 19
Joined: Aug 3, 2009
Last seen: Oct 30, 2012
Let us implement Gil's descriptor idea initially. We can think of other interfaces later.


RE: Architecture
by gil_savir on Oct 10, 2009
gil_savir
Posts: 59
Joined: Dec 7, 2008
Last seen: May 10, 2021
I have a proposal for how the flow can be between FW and Encoder from a high level view. To simplify things I outlined only inbound flow into Encoder. I wish I could make a diagram to explain this better but at this time I have only textual description. If anything is not clear please let me know.

Overview:
There will be a "Video Stream Buffer" in memory that FW can write to and Encoder can read from. The location of this Video Stream Buffer is determined by FW (part of SoC configuration) and it programs its location and size to Encoder through registers. Encoder may place certain restrictions on alignment, minimum size etc. (TBD)

FW keeps writing frames or slices of video data into this buffer in memory. And Encoder keeps reading video slices from it. We can have a Slice Length Register to control the size of a slice.

Synchronization between FW and Encoder happens through two pointers called producer index and consumer index. FW owns producer index and Encoder owns consumer index. There will be registers representing these indexes.

Flow: (I have omitted lot of details to simplify the description)

1. At the time of initialization FW programs Video Stream Buffer's address and size registers.

(FW programs lot more registers at the time of initialization. It takes the core out of reset, programs video stream properties that need to be defined such as color space, image size etc. and encoding properties if any and other configuration settings...)

2. When FW receives video data it writes it to Video Stream Buffer and updates "Video Stream Buffer Producer Index Register".

3. Then Encoder starts reading video data in slices and starts encoding it. As it starts consuming video data (slices) it keeps updating consumer index at programmed location in memory (we can have "Video Stream Buffer Consumer Index Address Register") and generates interrupts. Consumer index gets updated every time interrupt is generated.

Encoder will support interrupt avoidance features to throttle interrupts. FW will fine tune them for optimal performance. (TBD)

4. FW processes the interrupts and writes some more data and updates the producer index (through the register).

If producer index reaches end of the buffer it wraps to the beginning of the buffer. The consumer index works the same way.

When producer index and consumer index point to the same location the buffer is considered empty. This is the reset state.

If producer index reaches a position, where if it advances by another slice length it equals consumer index, it indicates buffer full condition.

If buffer is full FW backs off and waits for Encoder to consume video data i.e. waits for interrupt before writing some more data.

Steps 2,3 and 4 are repeated by FW during video streaming.

Hi Chander,
In principle, the flow you describe here is possible to implement with FW-encoder communication through registers or descriptors, and is not far from what I described in previous post, only communication is done by registers. As I said before, we can easily, and with no much penalty, allow the encoder to communicate with the FW through registers AND descriptors. when the encoder reads some info from the descriptor it must store it in local registers. Making these registers available to FW thorough WB bus (WB slave i/f) would be easy. We could also put these registers in the same address order as in the descriptors, such that it won't be difficult to write FW that may deal both with descriptors and registers. it might even be designed such that FW can decide for each slice if it addresses the encoder through registers or through descriptor. For example, when FW asserts "encode now" bit, it may choose to de/assert "use descriptor" bit which tells the encoder if it should look for the slice data in descriptor or in registers.

In first look, using descriptors looks cumbersome. With descriptors FW writes info to memory and encoder has to read them from the memory (two bus accesses for each info element), whereas if FW writes the info directly to encoder registers, only one bus access is required for each info element. However, when using direct write to registers, the FW usually has small time window each time that it has to pass this info to the encoder. SoC CPUs deals with other system components apart of the encoder and might not always be available for writing directly to the encoder registers in this time-window. this will cause unnecessary performance delays.

Restricting the raw video input to reside on single memory buffer decided in advance is possible, yet, I don't see a reason to pose such restriction. I think it should be left to FW to "decide". If FW "sees it best" to use single memory chunk for input of raw video stream to the encoder, it may choose to do so by repeatedly using the same address span over and over. This way wraparound of the "video stream buffer" is not possible, but I don't see what benefit we will get from using the wraparound feature. could you please explain it?

Could you please explain what you mean by "interrupt avoidance features"? Is it interrupt mask?

If producer index reaches a position, where if it advances by another slice length it equals consumer index, it indicates buffer full condition.

If buffer is full FW backs off and waits for Encoder to consume video data i.e. waits for interrupt before writing some more data.
Such situation should not be possible. it suggest that the encoder cannot encode in real-time, and that's unacceptable! The whole point of the encoder is to provide real-time encoding. otherwise GPU-SW implementations can be used. what should the FW do with all the frames it receives in real time from a camera controller? FW can't stall the input stream.
If you meant for something that happens only in (rare) peak-overload, than I think that the encoder should deal with such problem by dropping slices/frames it didn't manage to encode yet, and go on with real-time encoding. I think that H.264 standard even allows such situation and lets the encoder use a skip NALU, or something like that. I think we should deal with such rare situations later. Anyhow, I think that the bottom line is that the encoder should be fast enough to deal with all inputs it is designed for.

- gil
RE: Architecture
by gil_savir on Oct 10, 2009
gil_savir
Posts: 59
Joined: Dec 7, 2008
Last seen: May 10, 2021
Hi All,
A small idea for the (far) future. x264 is implemented using frame level threading for better performance. I think that in the future of this project (say after we implement a SoC with the encoder), we might enhance core performance by duplicating (few times) the modules dealing with inter-prediction.
I don't think we should implement this now, but only bear in mind, during core development, that this option is possible in the future and make the current implementation "future ready" by few things (such as reserve space in descriptors for such future implementation).

please tell me what you think about it.

- gil
RE: Architecture
by jackoc on Oct 14, 2009
jackoc
Posts: 13
Joined: Sep 5, 2009
Last seen: May 16, 2010
Hi, all

I have some questions about the architecture based on descriptors. In my understanding, descrioptors are used only when several transfers will be done by hardware (for example, DMA) without the interruption of software. but by the talk above, the h.264 algorithm will be partitioned to SW-responsible part and HW-responsible part, that is, interruption by software is inevitable. Let me clear it using an example:

suppose the basic unit is one slice, then QP will be the same for all MBs in it (since rate control in HW at MB level will be exhausted and terrible), and QP of the next slice will be adjusted based on the return value from the current slice (such as total bits), so it is inevitably that software gets involved. I think the opeeration flow seems as follows to encode one slice:

current slice:
1) SW tell global info. to HW: picture width and height, frame or fields ...
2) SW initialise reference list for inter frame and enduce other info needed by HW...
3) SW tell other info. to HW: yuv address, bitstream address, reference index, and QP...
4) SW start HW encoding by configure register
5) SW read status info: OK or Error, busy or idle, total bits used and COST to control bitstream rate.
6) SW rate control
7) SW rearrange decoded picture buffer (DPB)

then next slice:
...
...

So, I think we should use registers instead of descriptors. Can anyone please clear the using of descriptors?
RE: Architecture
by jackoc on Oct 14, 2009
jackoc
Posts: 13
Joined: Sep 5, 2009
Last seen: May 16, 2010

I'm trying to understand how slicing is done in the x264 code. The H.264 standard does not specify it (since it only defines decoding). If someone knows how slicing is done, or where to find such information, please share. understanding slicing is crucial for deciding if it should be done in FW or HW.


slice partitioning is very complex and need multipass encoding, I think we should leave it to FW, and make the HW flexible to deal with SW's slice partition.
RE: Architecture
by jackoc on Oct 14, 2009
jackoc
Posts: 13
Joined: Sep 5, 2009
Last seen: May 16, 2010
What does most Camera controllers provide? Frames?
Most camera controllers (sometimes called "freame grabbers") provide raw video stream in frames.

The input to Encoder should be in a form that is common across Camera controllers. This avoids FW pre-processing the data.
I think Chander is right. There are painfully too many formats used (see www.fourcc.org). There are the RGB family and the YCbCr (sometimes wrongly called YUV) familiy. The YCbCr is divided to packed and plannar formats. as far as I know, the IYUV/I420 plannar) format is pretty popular. I suggest we shall start with support for this format, and later we will add support for more formats if we find it necessary (in FW or HW).

- gil


I agree.

we support yuv420, and leave other format to be converted to yuv420 by other IPs.
RE: Architecture
by jackoc on Oct 14, 2009
jackoc
Posts: 13
Joined: Sep 5, 2009
Last seen: May 16, 2010
I reviewed all posts just now, from I can see, the descriptor will contain the follows:

- raw data address and size
- destination bitstream address and size
- slice info (boundaries, type,...)
- frame info (size, start, end)
- memory location allocated for reference frames (in case the h264-encoder core will not contatin dedicated storage for reference frames)
- next descriptor address (if we want the possibility of descriptors linked-list)
- "encoding succeeded bit" that the encoder can assert when encoding was terminated with no errors/exceptions.

It seems imposible that use descriptors to decrease cpu load, because at least software should read status info to do rate control related things for each slice and reconfigure(modify) slice info (for example QP) at each descriptor. using descriptor make simple things complex, I suggest to use registers.

It's possible that if we only use descriptors to indicates rawdata/bitstream (addr, size) , that is, to manage video source and bitstream destination, but slice info can not be included.

BR

Jack
RE: Architecture
by toanfxt on Oct 14, 2009
toanfxt
Posts: 4
Joined: Jun 24, 2008
Last seen: Sep 18, 2017

- "encoding succeeded bit" that the encoder can assert when encoding was terminated with no errors/exceptions.
Jack


I think we should use encoder_interrupt (the Encoder generates the interrupt signal when it encodes successfully or errors). This makes FW do other useful things (like update new frame, network things) when the encoder core do slice encoding (the encoding time is quite long). When interrupt occurs, FW'll read status register to check either encoding OK or errors.


RE: Architecture
by jackoc on Oct 14, 2009
jackoc
Posts: 13
Joined: Sep 5, 2009
Last seen: May 16, 2010

- "encoding succeeded bit" that the encoder can assert when encoding was terminated with no errors/exceptions.
Jack


I think we should use encoder_interrupt (the Encoder generates the interrupt signal when it encodes successfully or errors). This makes FW do other useful things (like update new frame, network things) when the encoder core do slice encoding (the encoding time is quite long). When interrupt occurs, FW'll read status register to check either encoding OK or errors.



Yes, I agree with you, status register and interrupt should both be implemented.

BR
Jack
RE: Architecture
by gil_savir on Oct 14, 2009
gil_savir
Posts: 59
Joined: Dec 7, 2008
Last seen: May 10, 2021

- "encoding succeeded bit" that the encoder can assert when encoding was terminated with no errors/exceptions.
Jack


I think we should use encoder_interrupt (the Encoder generates the interrupt signal when it encodes successfully or errors). This makes FW do other useful things (like update new frame, network things) when the encoder core do slice encoding (the encoding time is quite long). When interrupt occurs, FW'll read status register to check either encoding OK or errors.



Yes, I agree with you, status register and interrupt should both be implemented.

BR
Jack

I agree. interrupt and interrupt status register should be implemented.

- gil
RE: Architecture
by gil_savir on Oct 14, 2009
gil_savir
Posts: 59
Joined: Dec 7, 2008
Last seen: May 10, 2021
I reviewed all posts just now, from I can see, the descriptor will contain the follows:

- raw data address and size
- destination bitstream address and size
- slice info (boundaries, type,...)
- frame info (size, start, end)
- memory location allocated for reference frames (in case the h264-encoder core will not contatin dedicated storage for reference frames)
- next descriptor address (if we want the possibility of descriptors linked-list)
- "encoding succeeded bit" that the encoder can assert when encoding was terminated with no errors/exceptions.

It seems imposible that use descriptors to decrease cpu load, because at least software should read status info to do rate control related things for each slice and reconfigure(modify) slice info (for example QP) at each descriptor. using descriptor make simple things complex, I suggest to use registers.

It's possible that if we only use descriptors to indicates rawdata/bitstream (addr, size) , that is, to manage video source and bitstream destination, but slice info can not be included.

BR

Jack


If the use of descriptors chaining wouldn't contribute for decreasing CPU then I don't see any point in using it. However, I think it depends on the need of FW to react to each slice, and that depends on how much work we leave for FW. If FW will have to decide QP each slice, then I guess descriptors are out. if no intervention of FW is required at the end of each slice (i.e., QP desicions by HW), then we will benefit from the use of descriptors. It is a matter of decision.

- gil
First Prev 3/3 no use no use
© copyright 1999-2024 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.